ATM: Extract training data #11263

tiferet · 2022-11-14T22:58:10Z

Implement the new query that selects data for training.

For now we include clauses that implement logic that is identical to the old queries, so that the final dataset is identical, but we also note as TODO items some constraints we may want to experiment with removing. We include a temporary wrapper query that converts the resulting data into the format expected by the endpoint pipeline.

This PR moves the couple of small pieces of ExtractEndpointData that are still needed into ExtractEndpointDataTraining.qll, and deletes ExtractEndpointDataTraining.ql, ExtractEndpointDataTraining.qll, and the associated test files.

Timing checks:

✅ KPI timing experiment: https://github.com/github/codeql-dca-main/issues/8563
☑️ The local runtime of endpoint_large_scale/ExtractEndpointDataTraining is slightly impacted: Increased from about 4s to about 5s.

Closes https://github.com/github/ml-ql-adaptive-threat-modeling/issues/2098

Implement the new query that selects data for training. For now we include clauses that implement logic that is identical to the old queries. Include a temporary wrapper query that converts the resulting data into the format expected by the endpoint pipeline. Move the small pieces of `ExtractEndpointData` that are still needed into `ExtractEndpointDataTraining.qll`.

Also remove the associated test files.

javascript/ql/experimental/adaptivethreatmodeling/modelbuilding/DebugResultInclusion.ql

tiferet · 2022-11-15T01:23:42Z

If the tests pass (they've been running for almost an hour now) and the KPI experiment is OK, then this PR will be ready for review when you get to work on Tuesday.

tiferet · 2022-11-15T01:26:12Z

Nice -- the checks finally passed ✅

tiferet · 2022-11-15T03:01:24Z

@kaeluka The KPI timing experiment is OK, so this PR is now ready for review 🏓

kaeluka

👍 this looks good to me. I've left a few nitpicks (in other words: not necessary to address before merging — feel free to address in your next PR).

I've tried to use the suggestions feature where possible to make this as effortless as possible on your part.

The one choice you should make before merging is what to do about my suggestion to delete the logic backing up the hasFlowFromSource value. If you accept the suggestion, you'd need to also fix the .expected file. You may ignore that suggestion, merge and I'll send a PR tomorrow that removes the flag after you merge this PR.

.../experimental/adaptivethreatmodeling/modelbuilding/extraction/ExtractEndpointDataTraining.ql

...experimental/adaptivethreatmodeling/modelbuilding/extraction/ExtractEndpointDataTraining.qll

kaeluka · 2022-11-15T13:06:13Z

...experimental/adaptivethreatmodeling/modelbuilding/extraction/ExtractEndpointDataTraining.qll

+  exists(endpoint.getFile().getRelativePath()) and
+  // Only select endpoints that can be part of a tainted flow: Constant expressions always evaluate to a constant
+  // primitive value. Therefore they can't ever appear in an alert, making them less interesting training examples.
+  // TODO: Experiment with removing this requirement.


Suggested change

// TODO: Experiment with removing this requirement.

// TODO: Turn this requirement into a characteristic.

(nitpick)

..right?

Hmm, I suppose we could. Would that characteristic have any (positive or negative) implications for any class? Or are you suggesting adding it as a characteristic with no implications, simply so that the modeling code could use it in type balancing?

Either way it would need to be done as an experiment, because it could impact ATM metrics, so we'd have to check whether it helps or hurts them.

...experimental/adaptivethreatmodeling/modelbuilding/extraction/ExtractEndpointDataTraining.qll

tiferet · 2022-11-15T17:01:39Z

Meta comment: In all the cases above, my goal in this PR was to reproduce the current data exactly, so that we won't need to do end-to-end testing before merging this PR. Once this basic framework is merged, the next step would be to open a PR that adjusts the logic in all the ways we think it should be adjusted, then run end-to-end testing. If metrics aren't hurt we could merge that PR and run a partial update process. Until the orchestrator is fully functional, that's a non-trivial process, so I'd rather not do it many times unnecessarily. That's why I prefer a strict separation between PRs that change the framework but leave the data identical, and can be verified with PR checks, followed by a small number of targeted PRs that improve the underlying logic/data but require end-to-end testing.

Co-authored-by: Stephan Brandauer <kaeluka@github.com>

kaeluka

Still lgtm 👍

tiferet · 2022-11-15T19:55:26Z

Still lgtm 👍

Thank you! I had some followup questions about some of your comments (e.g. #11263 (comment)), so let's keep discussing them in this PR even after it's merged. When we reach a conclusion I'll implement it in a subsequent PR.

tiferet added 2 commits November 14, 2022 14:33

Delete ExtractEndpointData.

b47723d

Also remove the associated test files.

github-actions bot added the ATM label Nov 14, 2022

github-advanced-security bot found potential problems Nov 14, 2022

View reviewed changes

javascript/ql/experimental/adaptivethreatmodeling/modelbuilding/DebugResultInclusion.ql Fixed Show fixed Hide fixed

tiferet added 2 commits November 14, 2022 15:33

Fix import errors in DebugResultInclusion.ql

6b7612f

Fix non-ascii character in docs

9ecff07

tiferet marked this pull request as ready for review November 15, 2022 01:20

tiferet requested review from a team and kaeluka and removed request for a team November 15, 2022 01:20

owen-mc changed the title ~~Extract training data~~ ATM: Extract training data Nov 15, 2022

kaeluka previously approved these changes Nov 15, 2022

View reviewed changes

Apply suggestions from code review

092e019

Co-authored-by: Stephan Brandauer <kaeluka@github.com>

tiferet dismissed kaeluka’s stale review via 092e019 November 15, 2022 18:48

tiferet requested a review from kaeluka November 15, 2022 19:13

Apply suggestion from code review

fc078a4

kaeluka approved these changes Nov 15, 2022

View reviewed changes

tiferet merged commit 710b215 into main Nov 15, 2022

tiferet deleted the tiferet/extract-training-data branch November 15, 2022 20:08

This was referenced Nov 16, 2022

ATM: Implement the current endpoint filters as EndpointCharacteristics #11281

Merged

ATM: Remove redundant code #11321

Merged

ATM: Simplify query configurations #11323

Merged

	// TODO: Experiment with removing this requirement.
	// TODO: Turn this requirement into a characteristic.

ATM: Extract training data #11263

ATM: Extract training data #11263

Uh oh!

Conversation

tiferet commented Nov 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tiferet commented Nov 15, 2022

Uh oh!

tiferet commented Nov 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiferet commented Nov 15, 2022

Uh oh!

kaeluka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaeluka Nov 15, 2022

Choose a reason for hiding this comment

Uh oh!

tiferet Nov 15, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tiferet commented Nov 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaeluka left a comment

Choose a reason for hiding this comment

Uh oh!

tiferet commented Nov 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tiferet commented Nov 14, 2022 •

edited

Loading

tiferet commented Nov 15, 2022 •

edited

Loading

tiferet commented Nov 15, 2022 •

edited

Loading